Scalable associative text mining network and method

ABSTRACT

A text mining network that improves the performance of search engines by using a network of computer entities with autonomous neural networks. Each neural network provides a weighted list of associated search terms for each search query. The lists of associated search terms from two or more computer entities are merged to a unique list of associated search terms by utilization of a virtual index algorithm. Document result sets from the autonomous entities are merged to a unique result set by a weighted combination of two or more result sets.

BACKGROUND

1. Field of the Invention

The present invention relates generally to systems and methods forcomputer-based searching.

2. Background of the Invention

Full-text searching of unstructured and semi-structured data is becomingmore and more important in the world of computing. The amount ofinformation available through the World Wide Web is voluminous. As of2006, Google used 2 petabytes of disk space(en.wikipedia.org/wiki/Petabyte). To improve search accuracy of web anddesktop searches, efforts have been invested in the improvement of pageranking in terms of relevance. Despite these efforts, however, a largegap still exists between the results returned by a search, and theresults desired by a user.

Currently, advanced efforts involve the utilization of a network ofcomputer processors for the calculation of document result sets, whichcalculation is based on the interconnectivity of documents (“The Anatomyof a large-Scale Hypertextual Web Search Engine”; Sergey Brin et al.;2000, pp. 1-29). The performance of the system is optimized byre-ranking the results (see U.S. application Ser. No. 10/351,316, filedon Jan. 27, 2003 by Krishna Bharat).

A drawback of this method is that only documents that are frequentlycited are found. Thus, relevant information from new or special sourcesis ignored. For example, a self-help group that provides relevantinformation on a private home page does not have any “Pagerank” untilthe home page is cited.

Search routines can also be based on novel approaches in the field ofartificial intelligence that utilize so-called artificial neuralnetworks (ANN) in order to provide search results that consider semanticconcepts and correlations and/or associations between terms anddocuments. One drawback of previously known neural network solutions fortext mining is that the software only operates on a single computer, andthus is limited to a certain amount of data.

Several ANNs utilize unsupervised clustering algorithms. Unsupervisedclustering algorithms fall into hierarchical or partitional paradigms.In general, similarities between all pairs of documents must bedetermined, thus making these approaches un-scalable.

In order to overcome the above-discussed drawbacks, it is desirable thatthe document result sets of a search engine be calculated based on thecontent of websites (and/or documents). Moreover, a need exists for ascalable neural network architecture that allows the generation ofneural networks on a scale so that that even petabytes of data can becomputed.

Thus, a need remains for additional optimization techniques that usedistributed neural networks and virtual indexes.

BRIEF SUMMARY OF THE INVENTION

In one embodiment of the present invention, a text mining systemcomprises a server device confined to receive search queries from one ormore clients and a network of autonomous computer entities each linkedto the server and each configured to provide documents and associativequeries based on an initial search query. The server is preferablyconfigured to assign relevance scores to the documents based on neuralnetwork algorithms that are implemented on each computer entity, assigna virtual index value to each associated query, calculate a mergedassociative query based an a weighted union of virtual index valuesreceived from two or more computer entities, and calculate a documentresult set based on the weighted union of document result sets from twoor more computer entities.

In an embodiment of the present invention, a method for providing to auser a document result set based on a search inquiry, comprisesreceiving a search request at a server device; forwarding the searchrequest to a plurality of computer devices; receiving, from eachcomputer device, a plurality of associated terms related to the searchrequest, a virtual index value for each associated term; and a weightedrelevance of each associated term determined by the each computerdevice. The method further comprises assigning a document weight factorfor each computer device, ranking terms received from the plurality ofcomputer devices based on the document weight factor of each computerdevice, and providing to the user a ranked set of terms based on theranking of the terms.

Preferably, the systems and methods of the present invention can be usedin combination with a conventional internet search engine such as Google(a product of Google, Inc.) or Yahoo (a product of Yahoo, Inc.) toincrease the usefulness of the results of the search procedure.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagram illustrating an exemplary system in which conceptsconsistent with the present innovation may be implemented.

FIG. 2 is a flow chart illustrating an algorithm providing a virtualindex.

FIG. 3 illustrates the calculation of the document weight factors.

FIG. 4 is a tabular comparison of biological neural networks and thetext mining network.

DETAILED DESCRIPTION OF THE INVENTION

The following detailed description of the invention refers to theaccompanying drawings. The detailed description does not limit theinvention. Instead, the scope of the invention is defined by theappended claims and equivalents.

FIG. 1. is a diagram illustrating an exemplary network system in whichconcepts/methods/procedures in accordance with the present invention maybe implemented. The network system includes multiple computer devices(e.g., 103 and 104), a server device 101, (which can comprise more thanone server, but is shown as a single computer for clarity) and externalclients 106 (shown, for simplicity as a single computer, but intended torepresent more generally a plurality of users that can each send anindependent search query) that provide the search queries to the textmining system. Client 106 may be, for example, a PC of an interne user,or a workplace PC of a newspaper archive. Connection 107 is a connectionthat transfers the search query to the server 101 and provides a resultset based on a search query to the user 106.

The computer devices 103, 104 each include a random access memorycoupled to a processor (not shown). Each processor performs a neuralnetwork algorithm suitable for text mining. An example for a text miningneural network is described in U.S. application Ser. No. 10/141,298,filed on May 8, 2002 by Doug Leno et al. (“Neural Network Feedback ForEnhancing Text Search”). A further example is described in U.S.application Ser. No. 10/451,188, filed on Dec. 14, 2001 by DavidGillespie (“Method of Document Searching”).

Each computer device 103, 104 is assigned a certain document repository.For example, the documents from a newspaper archive could be divided insix parts according to the publication years of the newspaper issues:Part 1: 2000-2007, Part 2: 1990-1999, Part 3: 1980-1989 etc.

Thus, in one implementation of the invention computer device #1 (103) isassigned part1, computer device #2 (104) is assigned Part 2, etc.

Thus, each computer device is an autonomous search engine. Server 101transfers the search query to the computer devices 103, 104 via networkconnections (102). The network may be formed based on standardtechnology as switches or hubs.

In a preferred embodiment of the present invention, each computer device103,104, etc. provides associated terms for each search query.Associated terms are calculated based on a statistical thesaurus(US-application US1996000616883 filed on Mar. 15, 1996 by David JamesMiller et al.) or on a neural network algorithm (e.g. Kohonen mapping).For example, the search query “Apollo” could be associated with thefollowing set of results:

Result from computer device #1: Armstrong, Eagle, moon, lunar,spacecraft

Result from computer device #2: Armstrong, Aldrin, Houston, moon, etc.

The first set of results collected from CD# 1 results are merged using avirtual index. The second set of results collected from CD#2 are mergedusing a virtual index, etc.

In our preferred embodiment, the computer devices 103, 104 do not sharea common index. For example, the word “moon” may be assigned the number1767 on the local index of computer device 103 and number 567456 oncomputer device 104.

In order to identify identical words provided by different computerdevices, a virtual index is calculated. FIG. 2. is a flow chartillustrating a method that can be performed on a computer or otherprocessing device for providing a virtual index, in accordance with anembodiment of the present invention.

For each word (i.e. string) a set of features to be indexed can bedefined. For example, calculation of an index value for words using athree-feature virtual indexing system proceeds according to the stepsillustrated in FIG. 2.

In step 202, a word is received by the system, for example, a wordsearch is input by user 106, which is sent to server 101. When a word isreceived in step 202, a routine, program or algorithm performed by thesystem, such as server 101, assigns an initial index value (number)(shown in FIG. 2 as “1”) of zero to the word.

In step 204, the server determines whether the character “e” iscontained in the word. If so, the process moves to step 206, where thecurrent index value (1) is increased by one. If no “e’ is present, theprocess moves to step 208.

In step 208, the system determines whether the letter “t’ is containedin the word. If so, the process moves to step 210, where the currentindex value (current value of 1) is increased by two. If no “t” ispresent, the process moves to step 212.

In step 212, the system determines whether the length of the word(string) is greater than five. If so, the process moves to step 214,where the current index value (current length) is increased by four. Ifnot, the process moves to the end.

In the example of FIG. 2. a three-Bit-number is assigned to the string.The word “Armstrong” would be assigned the number 6 (0+2+4), the word“Eagle” the number 1 (1+0+0) and the word “moon” would be assigned thenumber 0.

Typically, under such a scheme, the virtual index number for a givenword is not unique. For example, the words “Apollo” and “Aldrin” wouldbe assigned the same index number 4.

In a preferred embodiment of the present invention, a 24-Bit-number forproviding a virtual index number to search word is used. Thus, thesystem performs a check using 24 different features of the string. Itcan be shown that, based on a German newspaper archive a “nearly” uniquevirtual index can be defined using the 24-bit-number virtual index.

The advantage of using the virtual index procedure as described above isthat the index number can be computed very fast. If the text miningnetwork has to compute a huge amount of data (>1 petabyte), typicallymillions of different words will aggregate to the local index on eachcomputer device.

In a preferred embodiment of the present invention, each associated termreceived from each computer device is weighted according to a relevanceranking. This is a typical feature of statistical thesauri or neuralnetworks.

For example, as illustrated below, each computer device provides a listof virtual index id's and weights expressed as a percent, for each term:

Computer device #1:

Associated term #1: “Armstrong” Virtual index 6, relevance 100%(R1,0=1,00)

Associated term #2: “Eagle”: Virtual index 1, relevance 80% (R1,4=0,80)

Associated term #3: “moon”: Virtual index 0, relevance 50% (R1, 5=0,50)etc.

Computer device #2:

Associated term #1: “Armstrong” Virtual index 6, relevance 100%(R2,0=1,00)

Associated term #2: “Aldrin”: Virtual index 4, relevance 92% (R2,6=0,92)

Associated term #3: “Eagle”: Virtual index 1, relevance 41% (R2,4=0,41)etc.

In a preferred embodiment of the present invention, server 101calculates and stores an array of ranking scores for each virtual indexentry. Each computer device 103, 104 provides a partial sum of relevancevalues. Thus, the calculation of the relevance of the resultingassociated terms is a simple addition of local result sets, which can beperformed very rapidly. Moreover, a seamless and scalable integration ofhuge amounts of data can be realized.

Relative Ranking of Computer Devices

In a preferred embodiment of the present invention, each computer devicecan be assigned a relevance value. One example is the value of theinformation content of the assigned document repository. The informationcontent I is calculated as the logarithm of the Number N of differentwords that represent the local index of computer device k: I=log(Nk).

If Rk,j is the relevance of virtual index term j on computer device k,then

C=log(Nk)*Rk,j

describes the contribution to the overall result set calculated by theserver 101.

In one embodiment of the present invention, more than one server 101 isintegrated in the network and partial sums can be calculated. Moreover,the relative ranking can be performed by more sophisticated functions f,that utilize the information content I as a parameter f=f(I).

Relative Ranking of Documents

In accordance with embodiments of the present invention, each computerentity is an autonomous search engine. Thus, each computer entity ndelivers a list of relevant documents Ln,m according to the initialsearch query. Relevance ranking is usually provided on a percentagescale, i.e. the most relevant document is assigned the relevance 100%.

In order to define the relative ranking of documents that are providedby different computer entities, each computer entity n is assigned adocument weight factor Dn (FIG. 3).

FIG. 3 illustrates features involved in a process for the computation ofdocument weight factors in accordance with an embodiment of the presentinvention. The user 300 types the query “Apollo”. Two differentcomputers with different document repositories (e.g., 103 and 104 inFIG. 1) deliver different associated terms. Computer device #1 deliversthe associated terms “Armstrong”, “Eagle” and “moon” with differentabsolute (Rj,k_abs) and relative relevance values. Computer device #2delivers the associated terms “Armstrong”, “Aldrin” and “Eagle”.

The document weight factor Dn of computer device n is defined as themaximum absolute value max(R_j_k_abs) of the relevance of associatedterms (FIG. 3).

The Re-Ranking is based on the local document set ranking Fm,n on apercentage scale for the documents m on computer device n.

NewScore(m,n)=Dn*Fm,n.

In the example of FIG. 3 the most relevant document on computer device#1 will assigned the new ranking score 6,67*1,00=6.67.

Thus, the most relevant document on the entire text mining network hasthe ranking score given by Maxscore=max(NewScore(m,n)).

Preferably, sorting algorithms of the state of the art can be used toprovide a complete sorted list of relevant documents.

In a preferred embodiment of the present invention, documents are storedlocally at each computer device that can be addressed via a given IP(Internet Protocol) address. For example, computer device #n may beassigned the internal IP-address 192.168.0.n. The local document m ofcomputer device n can be obtained via TCP/IP (e.g.,http://192.168.0.n:PORT/document=m).

A suitable software for this purposes is the Kinkadee® Server providedby Kinkadee® Systems GmbH, Bocholt (Germany).

Principles of Neuronal Preprocessing

An important concept that underpins the scalable associative text miningnetwork of the present invention is neuronal preprocessing, which can beexplained using the following examples. The election of the President ofthe United States and the Vice President of the United States isindirect. Presidential electors are selected on a state by state basisas determined by the laws of each state. All states except Nebraska andMaine use a winner-take-all allocation of electors. From aninformation-theoretic point of view this process utilizes informationreduction in order to enforce a decision making.

A second example regards the associative storage of our brain—theneocortex. The neurons of the neocortex are arranged in structurescalled neocortical columns. These columns are similar and can be thoughtof as the basic repeating functional units of the neocortex. In humans,the neocortex consists of about half million of these columns, each ofwhich contains approximately 60,000 neurons.

FIG. 4 illustrates the analogy between biological neural networks andthe text mining networks. For example, a single computer that stores60,000 documents represents a cortical column. A network of 100computers thus stores 6 million documents. Each computing device (i.e.cortical column) operates as an autonomous unit. Given a certain query,each computer utilizes a neural network that computes associated terms.

Typically, a term can be associated with more than 10,000 other terms.These associated terms are represented as vectors. Without the use ofany preprocessing, the sum of all vectors has to be calculated.Therefore, each computer has to transfer a huge amount of data. Eachassociated term is assigned two numerical values: an index value (e.g.the term “Armstrong” is term number 134593) and a weight value (e.g.“Armstrong” has an association value of 0.845 in respect to the searchquery). In the example of 100 computers therefore approximately 2million numerical values have to be transferred and calculated.Especially, a global index for the word vectors would be necessary.These considerations illustrate why neural networks have not beenapplied for internet search engines.

In contrast, embodiments of the present invention, which employ neuronalpreprocessing steps, offer advantages in the performance with respect toknown neural network schemes. In the preferred embodiment, each computertransfers only the Top-20-associations (20 best associations) for eachsearch query. In the example of a system comprising 100 computers, only2000 numerical values have to be added. Moreover, the utilization of thevirtual index algorithm of the present invention avoids the necessity ofusing a global index for the text mining network. In fact, in a textmining network of the present invention the total number of unique wordsremains unknown. Thus, the system and method of the present inventionapproach is fundamentally different from knownstate-of-the-art-solutions that utilize algorithms like K-means or SOM,which calculate global word vectors.

The foregoing disclosure of the preferred embodiments of the presentinvention has been presented for purposes of illustration anddescription. It is not intended to be exhaustive or to limit theinvention to the precise forms disclosed. Many variations andmodifications of the embodiments described herein will be apparent toone of ordinary skill in the art in light of the above disclosure. Thescope of the invention is to be defined only by the claims appendedhereto, and by their equivalents.

Further, in describing representative embodiments of the presentinvention, the specification may have presented the method and/orprocess of the present invention as a particular sequence of steps.However, to the extent that the method or process does not rely on theparticular order of steps set forth herein, the method or process shouldnot be limited to the particular sequence of steps described. As one ofordinary skill in the art would appreciate, other sequences of steps maybe possible. Therefore, the particular order of the steps set forth inthe specification should not be construed as limitations on the claims.In addition, the claims directed to the method and/or process of thepresent invention should not be limited to the performance of theirsteps in the order written, and one skilled in the art can readilyappreciate that the sequences may be varied and still remain within thespirit and scope of the present invention.

1. A text mining system comprising: a server device configured toreceive search queries from one or more clients; and a network ofautonomous computer entities each linked to the server and eachconfigured to provide documents and associative queries based on aninitial search query, wherein the server is configured to: assignrelevance scores to the documents based on neural network algorithmsthat are implemented on each computer entity; assign a virtual indexvalue to each associated query; calculate a merged associative querybased on a weighted union of virtual index values received from two ormore computer entities; and calculate a document result set based on theweighted union of document result sets from two or more computerentities.
 2. The system of claim 1, wherein calculation of the documentresult set further includes forming a sub-set of documents from theinitial set of documents from a particular computer entity by splittingone or more documents in parts due to the document length.
 3. The systemof claim 1, wherein a predefined number of documents defines a sub-setof documents that is used for the calculation of associated queries. 4.The system of claim 1, wherein the server device comprises more than oneserver.
 5. The system of claim 1, wherein each of the two or morecomputer entities is assigned a specific document repository.
 6. Thesystem of claim 1, wherein each of the two or more computer entitiescomprises random access memory and a microprocessor.
 7. The system ofclaim 1, wherein the virtual index is based on a 24-bit number scheme inwhich 24 different features of each word of a search query are checked.8. The system of claim 1, wherein each computer entity of the network ofcomputer entities returns only about 20 best associations to the serverentity in response to a search query.
 9. A method for providing to auser a document result set based on a search inquiry, comprising:receiving a search request at a server device; forwarding the searchrequest to a plurality of computer devices; receiving, from eachcomputer device: a plurality of associated terms related to the searchrequest; a virtual index value for each associated term; and a weightedrelevance of each associated term determined by the each computerdevice; assigning a document weight factor for each computer device;ranking terms received from the plurality of computer devices based onthe document weight factor of each computer device; and providing to theuser a ranked set of terms based on the ranking of the terms.
 10. Themethod of claim 9, wherein the virtual index is based on a 24-bit numberscheme in which 24 different features of each word of a search query arechecked.
 11. The method of claim 9, further comprising: assigning aseparate document repository to each computer device of the plurality ofcomputer devices; assigning a relative ranking of the each computerdevice according to information content (I) contained in the computerdevice.
 12. The method of claim 11, wherein the information content (I)of a computer device (k) is given by I=log(N_(k)), where N is the numberof different words in a local index of computer device k.
 13. The methodof claim 9, wherein the document weight factor is a maximum absoluterelevance of any associated term of a given computer device.
 14. Themethod of claim 9, wherein each computer device returns only about 20best associations to the server entity in response to a search query.15. The method of claim 9, wherein documents associated with each searchterm are stored locally in each computer device, and wherein eachdocument can be accessed by the server device at an internet protocol(IP) address associated with the each document.
 16. The method of claim9, wherein providing to the user a ranked set of terms comprisesproviding a set of search results on a graphical display associated witha user computing device linked to the server device.
 17. The method ofclaim 16, wherein the user computing device is linked to the serverdevice over a data network connection.
 18. The method of claim 9,wherein each computing device comprises a microprocessor and randomaccess memory.
 19. The method of claim 9, wherein the server devicecomprises more than one server.
 20. The method of claim 9, wherein eachcomputer device is an autonomous search engine.